双向固定效应模型中是否要控制公司年龄?
👇 连享会 · 推文导航 | www.lianxh.cn
🍎 Stata:Stata基础 | Stata绘图 | Stata程序 | Stata新命令 📘 论文:数据处理 | 结果输出 | 论文写作 | 数据分享 💹 计量:回归分析 | 交乘项-调节 | IV-GMM | 时间序列 | 面板数据 | 空间计量 | Probit-Logit | 分位数回归 ⛳ 专题:SFA-DEA | 生存分析 | 爬虫 | 机器学习 | 文本分析 🔃 因果:DID | RDD | 因果推断 | 合成控制法 | PSM-Matching 🔨 工具:工具软件 | Markdown | Python-R-Stata 🎧 课程:公开课-直播 | 计量专题 | 关于连享会
连享会 · 2022 暑期班
刘欣妍 (香港中文大学),liuxinyan@link.cuhk.edu.hk
史 柯 (中央财经大学),shike2231128@gmail.com
编者按:本文主要摘译自下文,特此致谢!
Source:Controlling for Log Age -Link-
温馨提示: 文中链接在微信中无法生效。请点击底部「阅读原文」。或直接长按/扫描如下二维码,直达原文:
目录
1. 引言
2. R 中的实现
2.1 模拟数据
2.2 估计模型
2.3 加入公司年龄的对数
3. Stata 中的实现
4. 相关推文
1. 引言
在实证分析中,我们通常需要控制某些变量来缓解遗漏变量问题,例如公司年龄 (age) 和公司规模 (size)。不过,当我们同时控制了公司固定效应和时间固定效应后,公司年龄与固定效应会存在共线性问题。为缓解上述问题,学者往往会对公司年龄取对数。那么这一方法是否可以发挥作用呢?
通过下文的分析,我们认为在实证研究中,需要将公司年龄的对数作为控制变量纳入到模型中。
2. R 中的实现
2.1 模拟数据
首先我们来模拟一个面板数据
# Install Packages
install.packages("magrittr") # package installations are only needed the first time you use it
install.packages("dplyr") # alternative installation of the %>%
install.packages("fastDummies")
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
library(tidyr)
library(fastDummies)
# simulate data
# set seed
set.seed(74792766)
# Generate data - 20,000 firms are placed in each group. Groups 3 and 4 are
# treated in 1998, Groups 1 and 2 are untreated
make_data <- function(...) {
# Fixed Effects
# get a list of 4,000 units that are randomly treated sometime in the panel
treated_units <- sample(1:20000, 4000)
# unit fixed effects
unit <- tibble(
unit = 1:20000,
unit_fe = rnorm(20000, 0, 1),
treat = if_else(unit %in% treated_units, 1, 0)) %>%
# make first and last year per unit, and treat year if treated
rowwise() %>%
mutate(first_year = sample(seq(1981, 2010), 1),
# pull last year as a randomly selected date bw first and 2010
last_year = if_else(first_year < 2010, sample(seq(first_year, 2010), 1),
as.integer(2010)),
# pull treat year as randomly selected year bw first and last if treated
treat_year = if_else(treat == 1,
if_else(first_year != last_year,
sample(first_year:last_year, 1), as.integer(first_year)),
as.integer(0))) %>%
ungroup()
# year fixed effects
year <- tibble(
year = 1981:2010,
year_fe = rnorm(30, 0, 1))
# make panel
crossing(unit, year) %>%
arrange(unit, year) %>%
# keep only if year between first and last year
rowwise() %>%
filter(year %>% between(first_year, last_year)) %>%
ungroup() %>%
# make error term, treat term and log age term
mutate(error = rnorm(nrow(.), 0, 1),
posttreat = if_else(treat == 1 & year >= treat_year, 1, 0),
rel_year = if_else(treat == 1, year - treat_year, as.integer(NA)),
tau = if_else(posttreat == 1, .2, 0),
firm_age = year - first_year,
log_age = log(firm_age + 1)) %>%
# make cumulative treatment effects
group_by(unit) %>%
mutate(cumtau = cumsum(tau)) %>%
ungroup() %>%
# finally, make dummy variables
bind_cols(dummy_cols(tibble(lag = .$rel_year), select_columns = "lag",
ignore_na = TRUE, remove_selected_columns = TRUE)) %>%
# replace equal to 0 for all lead lag columns
mutate_at(vars(starts_with("lag_")), ~replace_na(., 0))
}
# make data
data <- make_data()
得到的模拟数据的情况如下:
2.2 估计模型
在得到模拟数据后,我们假设公司年龄变量的对数 (log_age,即 year-first_year 的对数) 与结果变量和处理变量都不相关。进而,我们可以通过估计一个简单的双向固定效应模型 (Two-way fixed effect),来画出真实的处理效应与估计的处理效应。具体模型如下所示:
#Install packages
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("lfe")
library(ggplot2)
library(tidyverse)
library(lfe)
#Draw plot
theme_set(theme_clean() + theme(plot.background = element_blank()))
# first make a plot where log age has no impact on the outcome variable
# get var names in a vector - need to drop the most negative lag (lag_1) and
min <- min(data$rel_year, na.rm = TRUE) + 1
max <- max(data$rel_year, na.rm = TRUE)
# make string for rel time indicators
indics <- paste0(c(paste0("`lag_", seq(min, -2), "`"),
paste0("lag_", seq(0, max))),
collapse = " + ")
# make the formula we want to run
form <- as.formula(paste0("y ~ ", indics, "| unit + year | 0 | unit"))
# get true taus to merge in
true_taus <- tibble(
time = seq(-10, 10),
true_tau = c(rep(0, length(-10:-1)), .2*(0:10 + 1))
)
# estimate the model and make the plot
data %>%
# generate the outcome variable
mutate(y = unit_fe + year_fe + cumtau + error) %>%
# run the model and tidy up
do(broom::tidy(felm(form, data = .), conf.int = TRUE)) %>%
filter(term != "firm_age") %>%
# make the relative time variable and keep just what we need
mutate(time = c(min:(-2), 0:max)) %>%
select(time, estimate, conf.low, conf.high) %>%
# bring back in missing indicator for t = -1
bind_rows(tibble(
time = -1, estimate = 0, conf.low = 0, conf.high = 0
)) %>%
# keep just -10 to 10
filter(time %>% between(-10, 10)) %>%
left_join(true_taus) %>%
# split the error bands by pre-post
mutate(band_groups = case_when(
time < -1 ~ "Pre",
time >= 0 ~ "Post",
time == -1 ~ ""
)) %>%
# plot it
ggplot(aes(x = time, y = estimate)) +
geom_line(aes(x = time, y = true_tau, color = "True Effect"), size = 1.5, linetype = "dashed") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, group = band_groups),
color = "lightgrey", alpha = 1/4) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high, color = "Estimated Effect"),
show.legend = FALSE) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = -0.5, linetype = "dashed") +
scale_x_continuous(breaks = seq(-10, 10, by = 2)) +
labs(x = "Relative Time", y = "Estimate") +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16))
2.3 加入公司年龄的对数
当公司年龄对数与结果变量和处理变量均不相关时,加入公司年龄的对数并不会影响我们的估计结果。具体如下图所示:
theme_set(theme_clean() + theme(plot.background = element_blank()))
# make string for rel time indicators
indics <- paste0(c(paste0("`lag_", seq(min, -2), "`"),
paste0("lag_", seq(0, max))),
collapse = " + ")
# make the formula we want to run
form <- as.formula(paste0("y ~ ", indics, " + log_age | unit + year | 0 | unit"))
# estimate the model and make the plot
data %>%
# generate the outcome variable
mutate(y = unit_fe + year_fe + cumtau + error) %>%
# run the model and tidy up
do(broom::tidy(felm(form, data = .), conf.int = TRUE)) %>%
filter(term != "log_age") %>%
# make the relative time variable and keep just what we need
mutate(time = c(min:(-2), 0:max)) %>%
select(time, estimate, conf.low, conf.high) %>%
# bring back in missing indicator for t = -1
bind_rows(tibble(
time = -1, estimate = 0, conf.low = 0, conf.high = 0
)) %>%
# keep just -10 to 10
filter(time %>% between(-10, 10)) %>%
left_join(true_taus) %>%
# split the error bands by pre-post
mutate(band_groups = case_when(
time < -1 ~ "Pre",
time >= 0 ~ "Post",
time == -1 ~ ""
)) %>%
# plot it
ggplot(aes(x = time, y = estimate)) +
geom_line(aes(x = time, y = true_tau, color = "True Effect"), size = 1.5, linetype = "dashed") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, group = band_groups),
color = "lightgrey", alpha = 1/4) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high, color = "Estimated Effect"),
show.legend = FALSE) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = -0.5, linetype = "dashed") +
scale_x_continuous(breaks = seq(-10, 10, by = 2)) +
labs(x = "Relative Time", y = "Estimate") +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16))
当公司年龄对数与结果变量相关,而与处理变量不相关时,加入公司年龄的对数仍然不会影响我们的估计结果。需要注意的是,这里我们其实并不需要控制 log_age,因为它不是一个混杂变量 (只影响结果变量)。
theme_set(theme_clean() + theme(plot.background = element_blank()))
# next, assume that there is an effect on the outcome variable but no effect on the treatment outcome
# make string for rel time indicators
indics <- paste0(c(paste0("`lag_", seq(min, -2), "`"),
paste0("lag_", seq(0, max))),
collapse = " + ")
form <- as.formula(paste0("y ~ ", indics, " + log_age| unit + year | 0 | unit"))
# estimate the model and make the plot
data %>%
# generate the outcome variable
mutate(y = unit_fe + year_fe + cumtau + -.85*log_age + error) %>%
# run the model and tidy up
do(broom::tidy(felm(form, data = .), conf.int = TRUE)) %>%
# make the relative time variable and keep just what we need
filter(term != "log_age") %>%
mutate(time = c(min:(-2), 0:max)) %>%
select(time, estimate, conf.low, conf.high) %>%
# bring back in missing indicator for t = -1
# bring back in missing indicator for t = -1
bind_rows(tibble(
time = -1, estimate = 0, conf.low = 0, conf.high = 0
)) %>%
# keep just -10 to 10
filter(time %>% between(-10, 10)) %>%
left_join(true_taus) %>%
# split the error bands by pre-post
mutate(band_groups = case_when(
time < -1 ~ "Pre",
time >= 0 ~ "Post",
time == -1 ~ ""
)) %>%
# plot it
ggplot(aes(x = time, y = estimate)) +
geom_line(aes(x = time, y = true_tau, color = "True Effect"), size = 1.5, linetype = "dashed") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, group = band_groups),
color = "lightgrey", alpha = 1/4) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high, color = "Estimated Effect"),
show.legend = FALSE) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = -0.5, linetype = "dashed") +
scale_x_continuous(breaks = seq(-10, 10, by = 2)) +
labs(x = "Relative Time", y = "Estimate") +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16))
当公司年龄对数与结果变量和处理变量都相关时,公司年龄对数就是一个混杂因素,此时是否需要将其纳入回归呢?我们通过以下示例可以看出,不加入公司年龄对数将得到一个有偏的估计。
theme_set(theme_clean() + theme(plot.background = element_blank()))
# Now assume that the log age variable is correlated in some way with the treatment assignment decision
# in particular assume that younger firms are more likely to get targeted
# Generate data - 20,000 firms are placed in each group. Groups 3 and 4 are
# treated in 1998, Groups 1 and 2 are untreated
make_data <- function(...) {
# Fixed Effects ------------------------------------------------
# unit fixed effects
unit <- tibble(
unit = 1:20000,
unit_fe = rnorm(20000, 0, 1)) %>%
# make first and last year per unit, and treat year if treated
rowwise() %>%
mutate(first_year = sample(seq(1981, 2010), 1),
# pull last year as a randomly selected date bw first and 2010
last_year = if_else(first_year < 2010, sample(seq(first_year, 2010), 1),
as.integer(2010))) %>%
ungroup()
# year fixed effects
year <- tibble(
year = 1981:2010,
year_fe = rnorm(30, 0, 1))
# make panel
data <- crossing(unit, year) %>%
arrange(unit, year) %>%
# keep only if year between first and last year
rowwise() %>%
filter(year %>% between(first_year, last_year)) %>%
ungroup() %>%
mutate(firm_age = year - first_year)
# make an average age data frame
avg_age <- data %>%
group_by(unit) %>%
summarize(avg_age = mean(firm_age)) %>%
mutate(weight = 1 / (avg_age + 1))
# sample 4,000 firms to get treatment, weighted by average age
treated_units <- sample_n(avg_age, 4000, replace = FALSE, weight = weight) %>%
mutate(treat = 1) %>%
select(unit, treat)
# merge treat back into the data
treat_data <- data %>%
select(unit, first_year, last_year) %>%
distinct() %>%
left_join(treated_units) %>%
replace_na(list(treat = 0)) %>%
rowwise() %>%
mutate(treat_year = if_else(treat == 1,
if_else(first_year != last_year,
sample(first_year:last_year, 1), as.integer(first_year)),
as.integer(0)))
# merge back into main data
data <- data %>%
left_join(treat_data) %>%
# make error term, treat term and log age term
mutate(error = rnorm(nrow(.), 0, 1),
posttreat = if_else(treat == 1 & year >= treat_year, 1, 0),
rel_year = if_else(treat == 1, year - treat_year, as.integer(NA)),
tau = if_else(posttreat == 1, .2, 0),
firm_age = year - first_year,
log_age = log(firm_age + 1)) %>%
# make cumulative treatment effects
group_by(unit) %>%
mutate(cumtau = cumsum(tau)) %>%
ungroup() %>%
# finally, make dummy variables
bind_cols(dummy_cols(tibble(lag = .$rel_year), select_columns = "lag",
ignore_na = TRUE, remove_selected_columns = TRUE)) %>%
# replace equal to 0 for all lead lag columns
mutate_at(vars(starts_with("lag_")), ~replace_na(., 0))
}
# make data
data2 <- make_data()
# get var names in a vector - need to drop the most negative lag (lag_1) and
min <- min(data2$rel_year, na.rm = TRUE) + 1
max <- max(data2$rel_year, na.rm = TRUE)
# make string for rel time indicators
indics <- paste0(c(paste0("`lag_", seq(min, -2), "`"),
paste0("lag_", seq(0, max))),
collapse = " + ")
# make the formula we want to run
form1 <- as.formula(paste0("y ~ ", indics, "| unit + year | 0 | unit"))
form2 <- as.formula(paste0("y ~ ", indics, " + log_age| unit + year | 0 | unit"))
# estimate the model and make the plot
data2 %>%
# generate the outcome variable
mutate(y = unit_fe + year_fe + cumtau + -.85*log_age + error) %>%
# run the model and tidy up
do(bind_rows(
broom::tidy(felm(form1, data = .), conf.int = TRUE) %>% mutate(mod = "No Control Log Age"),
broom::tidy(felm(form2, data = .), conf.int = TRUE) %>% mutate(mod = "Control Log Age"))) %>%
# make the relative time variable and keep just what we need
filter(term != "log_age") %>%
mutate(time = rep(c(min:(-2), 0:max), 2)) %>%
select(time, estimate, conf.low, conf.high, mod) %>%
# bring back in missing indicator for t = -1
bind_rows(tibble(
time = rep(-1, 2), estimate = rep(0, 2), conf.low = rep(0, 2),
conf.high = rep(0, 2), mod = c("No Control Log Age", "Control Log Age")
)) %>%
# keep just -10 to 10
filter(time %>% between(-10, 10)) %>%
left_join(true_taus) %>%
# split the error bands by pre-post
mutate(band_groups = case_when(
time < -1 ~ "Pre",
time >= 0 ~ "Post",
time == -1 ~ ""
)) %>%
# plot it
ggplot(aes(x = time, y = estimate)) +
geom_line(aes(x = time, y = true_tau, color = "True Effect"), size = 1.5, linetype = "dashed") +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, group = band_groups),
color = "lightgrey", alpha = 1/4) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high, color = "Estimated Effect"),
show.legend = FALSE) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = -0.5, linetype = "dashed") +
scale_x_continuous(breaks = seq(-10, 10, by = 2)) +
labs(x = "Relative Time", y = "Estimate") +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16),
panel.spacing = unit(2, "lines")) +
facet_wrap(~mod)
3. Stata 中的实现
下面本文将通过 Stata 复现 R 中的操作,由于篇幅所限,本文仅复现第一个图像。为了保证结果的统一,本文使用 R 中生成的数据进行相关回归分析及图像绘制。具体代码如下:
* 回归分析
reghdfe y `x', absorb(unit year)
* 保存回归结果
matrix a2=r(table)
matrix bandse1=a2[1..2,1..9]
matlist bandse1
matrix bandse2=a2[1..2,24..34]
matlist bandse2
matrix bandse0=[0,0]' // 补充-1期
mat bandse = (bandse0,bandse1,bandse2)'
mat list bandse
clear
svmat bandse // 将矩阵转化为变量
* 修改变量名
rename bandse1 estimated
rename bandse2 se
* 生成时间变量
gen reltime= (-1)*_n in 1/10
replace reltime=_n-11 in 11/21
* 生成 coef.min 及 coef.max (95%的置信区间)
gen estmin=estimated-se*1.96
gen estmax=estimated+se*1.96
* 生成 real effect
gen realeffect= (reltime+1)*0.2 in 11/21
replace realeffect= 0 in 1/10
* plot it
sort reltime
graph twoway (rarea estmax estmin reltime, color(gs12)) ///
(scatter estimated reltime, msize(small) mcolor(maroon)) ///
(line realeffect reltime,lpattern(dash) lcolor(navy) lwidth(medium)), ///
xtitle("Relative Time") ytitle("Estimate") xticks(-10(2)10) ///
tline(-0.5, noextend lcolor(clack) lpattern(dash)) ///
yline(0, noextend lcolor(clack)) ///
legend(label(1 "95% confidence interval") ///
label(2 "Estimated Effect") label(3 "True Effect"))
4. 相关推文
Note:产生如下推文列表的 Stata 命令为:
lianxh 控制变量, m
安装最新版lianxh
命令:
ssc install lianxh, replace
专题:Stata命令 Stata:控制变量组合的筛选-tuples Stata新命令-pdslasso:众多控制变量和工具变量如何挑选? 专题:回归分析 调节效应是否需要考虑对控制变量交乘? 控制变量!控制变量! 不用太关心控制变量,真的! 加入控制变量后结果悲催了! 专题:IV-GMM Lasso一下:再多的控制变量和工具变量我也不怕-T217 专题:断点回归RDD RDD:断点回归可以加入控制变量吗? Stata:RDD-中可以加入控制变量 专题:其它 锚定情境法(一):有效控制变量自评偏差
课程推荐:因果推断实用计量方法
主讲老师:丘嘉平教授
🍓 课程主页:https://gitee.com/lianxh/YGqjp
New! Stata 搜索神器:
lianxh
和songbl
GIF 动图介绍
搜: 推文、数据分享、期刊论文、重现代码 ……
👉 安装:
. ssc install lianxh
. ssc install songbl
👉 使用:
. lianxh DID 倍分法
. songbl all
🍏 关于我们
连享会 ( www.lianxh.cn,推文列表) 由中山大学连玉君老师团队创办,定期分享实证分析经验。 直通车: 👉【百度一下: 连享会】即可直达连享会主页。亦可进一步添加 「知乎」,「b 站」,「面板数据」,「公开课」 等关键词细化搜索。